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Abstract 

In this paper we offer a new parallel algorithm for solving systems of linear algebraic equations with the same 
block-tridiagonal matrix but with different right-hand sides. The method under study is a generalization 
of the parallel Dichotomy Algorithm for solving systems of linear equations with tridiagonal matrices [H . 
Based on the approach developed, we propose a parallel realization of the domain decomposition method 
(the Schur complement method). The calculation of acoustic wave fields by the spectral-difference technique 
has proved the efhciency of the developed parallel algorithms. A near-linear dependence of a speedup value on 
the number of processors is attained both using several and several thousands of processors. The innovation 
of this study is in that the developed parallel algorithm for solving block-tridiagonal systems of equations 
allows and effective and, which is no less important, simple realization of efficient numerical procedures for 
solving engineering tasks on a supercomputer. 

Keywords: Parallel Dichotomy Algorithm, Block-tridiagonal matrices. Domain Decomposition Method, 
Laguerre Transform, Acoustic Solver, PML absorbing boundary condition 
PACS: 02.60.Dc, 02.60. Cb, 02.70.Bf, 02.70.Hm 



1. Introduction 

Solving systems of linear algebraic equations (SLAEs) is one of the main problems of computational 
mathematics. With the advent of multiprocessor computer systems it appeared possible to reduce to some 
extent computer costs. However in the course of investigation it became evident that most of efficient 
numerical methods cannot be effectively implemented for supercomputers with many processors. As super- 
computer performance is mainly increased at the cost of the join of a large number of processors, there arises 
a necessity to develop the new parallel numerical algorithms for solving SLAEs. 

When implementing many numerical techniques, it is required to solve SLAEs with block-tridiagonal 
matrices[ili,|3| 
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By now, various algorithms for solving problem ([T]) on a multi-processor computer system have been 
developed [3, @, 0, B S 113; U | ■ But for a multiple solution of SLAEs with the same matrix using the 
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Dichotomy Algorithm this procedure is possible to offer a parallel algorithm with a higher performance as 
compared to other approaches. The Dichotomy Algorithm is compatible with other algorithms, however it 
essentially benefits in terms of the time needed for interprocessor interactions. This results from the fact 
that when implementing the dichotomy process on a supercomputer it reduces to the calculation of a sum 
of a series for distributed data thus essentially decreasing the total computing timeT|. 

First the parallel Dichotomy Algorithm was designed for solving SLAEs with the same tridiagonal matrix 



but different right-hand sides. In [12[, the Dichotomy Algorithm was applied to solving SLAEs with Toeplitz 
tridiagonal matrices. It was shown that for Toeplitz tridiagonal matrices, SLAEs can be efi^ectively solved 
both with one and several right-hand sides. In the Dichotomy Algorithm was applied to implement 

a spectral-diff'erence method of calculation of acoustic and elastic wave fields. This made it possible to 
effectively use from 2 up to 8192 processors per one calculation and to obtain a highly accurate numerical 
solution of the dynamic problem of elasticity theory. Thus, all the above bears witness to the fact that the 
Dichotomy Algorithm for solving SLAEs with tridiagonal matrices is a powerful instrument of the numerical 
modeling. In this paper we propose the new parallel algorithm based on the Dichotomy Algorithm for 
solving problem ([T]). 

When solving many mechanics problems, algorithms based on the domain decomposition method are 
widespread [lii 

M m Such an approach has proved its efficiency for calculations on one-processor com- 
puters. However with parallel realization of the domain decomposition method, difficulties emerge due to 
the necessity of implementing efficient algorithms for solving SLAEs. The fact is, efficient methods are, as 
a rule, difficult to parallelize. We will show that the numerical procedure developed for solving problem ([ij 
will allow the effective use of the domain decomposition method (the Schur complement method) for the 
simulation of acoustic wave fields with thousands of processors. 

2. The Parallel Dichotomy Algorithm for block-tridiagonal matrices 

2.1. The central idea 

Introduce the following notations: 

• Denote by the matrix obtained from a matrix A by throwing off all rows and columns with the 
numbers less than I or greater than t. 

• Denote by {V}* the subvector obtained from a vector V by throwing off the components with the 
numbers less than / or greater than t. 

• Denote by e^ = (1, 0, 0, 0)'^ , e^^ = (0, 0, 0, 1)"^. 

Omitting unnecessary details, let us formulate a step of the dichotomy process for dividing system ([T|) into 
two independent subproblems by calculation of the element X^. 
Algorithm 1 

1. Calculate rows of the matrix P~'^ with numbers, where i = {K - 1)M + 1,{K - 1)M + 2, KM. 

2. Calculate the subvector 

3. Transfer from system ([IJ to two independent subsystems by modifying the right-hand side 

^p}iK-i)M ^^^iK-i)M ^ |F|(i^-i)M ^ ^ ^Bk-iXk) ,K>1, (2a) 

{P}^^+, m^Z+i - mKM+i + » {Ak+i1Lk) ,K<N. (2b) 
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Further a similar procedure is applied to independent subproblems (|2ap and (|2b|) . Thus, all the compo- 
nents from the solution vector will be calculated in [log2 N~\ steps. Rows of the inverse matrix are stored 
in the course of calculation and are not recalculated for each right-hand side. As a result, the Dichotomy 
Algorithm allows "multiplication" of a vector of the right-hand side by the matrix P^^ in 0(Af^A^log2 N) 
arithmetical operations, while the direct multiplication would demand 0{M^N^) operations. The number of 
arithmetical operations is decreased because when multiplying a vector by the matrix the information 
about the structure of the matrix P is used in the Dichotomy Algorithm. 

At this point, a consideration of the solution of problem ^ with the help of the Dichotomy Algorithm 
could be completed if there were no an essential complication: it is required to carry out 0(M'^N) arithmeti- 
cal operations as preliminary to the Dichotomy Algorithm in order to calculate rows of the matrix P~^0]. 
Such arithmetical costs for large M and N can be unacceptable. Moreover, each processor will require 
SM'^N RAM cells for storing a copy of the matrix P. The use of a supercomputer suggests the solutions 
of SLAEs of high orders, therefore it is necessary to decrease the required volume of RAM and to minimize 
the time of preliminary calculations, otherwise it will be impossible to use the Dichotomy Algorithm. 



2.2. An Improved version of the algorithm 

In Algorithm 1, the basic idea of dividing SLAEs with block-tridiagonal matrices is considered. If the 
number of processors exceeds the order of the matrix, then auxiliary values f3^'^, Z^'^ are introduced [ll. [l2j . 
But as was noted above, such an approach for block-tridiagonal systems requires high computer costs because 
its implementation requires solving the original equations system on each processor. Let us explain how to 
overcome this difficulty. 

a parallel algorithm based on the superposition principle for solving tridiagonal SLAEs, is 



proposed. Its central idea is in that the original SLAE reduces to a system of linear equations with a 
tridiagonal matrix of order p, where p is the number of processors. In order to calculate the matrix with 
the reduced system of equations, on each processor it is necessary to preliminarily solve local subsystems of 
N/p equations, where N is dimension of a tridiagonal SLAE. After solving the reduced system of equations 
all the components of the solution vector are independently calculated on each processor. 

A similar approach to solving SLAEs with block-tridiagonal matrices is considered in Q . It consists in 
the following. The solution to the original system of equations is expressed through Mp of whilst unknown 
components from the solution vector (Fig. [T]) : 

X, = (UiUf ...Uf) Xk + (V|Vf ...Vf) Xk+l + W, = U,%K + V,%k+l + W„ 

K = l,L + l,2L + l,...,{p-l)L + l- ie[K,K + L); L = N/p, (3) 

Xat+i = 0, 

where the matrices Ui,Vi G 9^mxm g^^^ ^^ic vector are defined from the solution to subproblems 
-A,Ui_i + aiJj - B,Ui+i =0, U]^ = ei, U}^+i = 0, 

(4a) 



-AVi_i + C.yi - A;Vi+i = 0, V], = 0, V],+^=ei, 
(4b) 

-A,W,_i + QW, - B,W,+i = F„ Wk = 0, Wk+l = 0, (5) 
where e„ is a unit vector in the space 
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Figure 1: Components of the solution vector to be calculated for dividing the original equations system into subproblems. 



From (HJ,® we obtain that values of the components X^c, K ^ 1, 2L + 1, SL + 1, {p — 1)L + 1 can be 
determined from solving a three-point system of vector equations. 

— [ArUk-i] Xa'_l + [Ck — AkVk-1 — BkUk+i] Xjc — [BjfVfc+i] '^k+l = 

< =Fa' + Aa-Wa--i+BkWk+i, K = l,L+l,2L + l...,{p-l)L + l, (6) 

Denote system ^ as PX = F. For solving system ([5]) the Dichotomy Algorithm can be applied more 
effectively than for solving system ([T]). This is due to the fact that reduced system ^ has the dimension 
Mp, while that of original problem ^ is MiV, where N > p. As a result, less computer time is needed for a 
preliminary to the Dichotomy Algorithm as well as a lesser RAM volume (SAf^iV vs. 3M^p). Thus, instead 
of Algorithm 1 one should use the following algorithm: 

Algorithm 2. 

1. The preliminary computations is carried out once for all the right-hand sides. 

1.1 Solve subproblems (|la|) . (j4bp independently on each processor. 

1.2 Calculate entries of the matrix P from ^ and send them to all the processors. 

1.3 On each processor calculate the required rows of the matrix from ^ (for Algorithm 1). 

2. The stage of calculating solutions is carried out for each right-hand side. 

2.1 On each processor solve independently subsystem ([5]). 

2.2 Solve system ([5]) by means of the Dichotomy Algorithm (Algorithm 1). 

2.3 In line with ([3]) calculate all the components of the solution vector. 

At the preliminary step to Algorithm 2 it is required to solve subsystems (|la|) , (j4bp . At this stage 
computer costs are about O [M^N/p^ arithmetical operations. In order to solve system ©, Algorithm 1 is 
used. Therefore it is needed to carry out O {Ad^p) arithmetical operations for calculation of necessary rows 
of the matrix . As entries of the matrix P are distributed among different processors, the calculation of 
required rows of the matrix P^^ will require interprocessor interactions. The time needed for interprocessor 
interactions for distributing copies of the matrix P among all the processors will be [20| 

Tcomm = a l0g2 P + ^PM^, 
P~ 1 

a-latency, /3-transfer time per byte. 

At the second stage of Algorithm 2, computer costs for solving system ([5]) and implementing ([3]) will be 
about O {^A'PN Ip^. Here the matrix of system ([5]) is assumed to be pre-factorized, and the matrices Ui, Vi, 
were computed at the preliminary step. Computer costs of Algorithm 1 for solving equation ^ are equal to 
O (M^ log2(p)) . Communication costs at the stage of solution calculation are conditioned by the dichotomy 
process and are estimated as [l| 

TL„„, « aloglip) + 4M2 log2(p)/3. 

In addition let us note that the necessary volume of RAM at this stage will be O [A/'PN/p ~\~ log2 p) , 
while at the preliminary step it makes O {A/PN/p + Ad^p). 



Depending on the algorithm of distribution. 
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2.3. Numerical experiments 

Let us consider the problem of solving a system of linear equations of the form of ([1]) with dimensions of 
blocks M = 60, 150 for TV from 2048 up to 65536. 

Numerical procedures were implemented in Fortran-90 using MPI library, calculation being performed 
on "MBC-lOOk" supercomputer of the Interdepartment Supercomputer Center of the Russian Academy of 
Sciences, (the 62-nd position in Top-500[2l|. November 2010). The results of the experiments conducted are 
given in Tables [TT2l and in Figl2j 
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Table 1: Preliminary time (Pre) and execution time (Exe). 2^^ = 2048. 
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Table 2: Preliminary time (Pre) and execution time (Exe). 2^* = 16384. 



Based on the data obtained let us note the following: 

• In all the test calculations, the value of dependence of the speedup value on the number of processors 
was near linear. 

• For matrices with M = 60-blocks, starting with a certain p > po the dependence of the speedup value 
on the number of processors was superlinear. This is due to increasing a general number of processors 
which, in turn, causes a decrease in the data volume of the problem for one processor. Thus, this 
allows a more effective use of a high-speed cache memory. A similar effect was achieved with a parallel 
realization of the ADI method^. 

• The preliminary time depends on the number of processors used. With a minor amount of processors, 
the main costs fall on solving problems (|la|) . (j4b[) and decrease with the growth of the number of 
processors. But starting with a certain p > po, preliminary costs for Algorithm 1 that are required for 
a subsequent solution of problem ^ become dominating. 

• For matrices with M = 150-blocks and the number of processors p = 2048 it appeared impossible to 
carry out preliminaries in a reasonable time. This is because the matrix of the reduced system has 
M'^P dimension and with p — 2048 processors cannot be completely located in RAM of one computer 
unit. The use of disk memory has considerably decreased the performance. 
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Figure 2: Dependence of the speedup value on the number of processors for M = 60 (a) and M = 150 (b) on various N. 



• For the parameters M = 150, TV = 2^^, p = 16,32 and M = 150, N = 2^^, p = 16 the insufficient 
volume of RAM because of a smaU number of processors did not allow solving problem (Pa|) . (|4b)) in a 
reasonable time. 

The numerical experiments have shown that the Dichotomy Algorithm provides a high efficiency of using 
supercomputer resources. When realizing the Dichotomy Algorithm in terms of numerical procedures one 
should pay attention to available volume of RAM, because as compared to iterative techniques both for the 
Dichotomy Algorithm and for most of direct methods of solving SLAEs a larger volume of RAM is needed. 

3. Acoustic Solver 

To gain greater insight into the Dichotomy Algorithm efficiency for solving applied problems of numerical 
modeling, in the cylindrical coordinate system (r, z), in the half-space z > we will consider the problem of 
modeling the propagation of acoustic waves from a point source 

p(x)^(x,i)=V[K(x)Vp(x,t)] + ^^^^i-i^/(t), t>0, x = (r,z), (7) 



where p(x, t) is the acoustic pressure, /c(x) is the density perturbations, k(x) /p(x) is the sound velocity, 
xq is the source coordinates. Suppose that problem ([7]) is solved with homogeneous initial conditions. 

A parallel version of the spectral-difference method for solving ([7]) was considered in 1^ 14 1 . The Laplace 



operator was selected as preconditioning operator. This allowed us to provide a high rate of convergence for 
media with moderate contrast. The use of the Dichotomy Algorithm for solving tridiagonal SLAEs made 
possible to attain a high calculation rate. However when a medium model includes zones of high and relatively 
low velocities, using the Laplace operator as preconditioning does not provide a high convergence rate of 
the iterative process for solving SLAEs. If it appears possible to distinguish macro-zones in the medium 
model, where the sound speed is constant or is slightly diverse, then it makes sense to use the domain 
decomposition Method. Parallel versions of the domain decomposition method were pro posed rather a long 



time ago 22|, |23| and recently algorithms with graphics accelerators have been offered 24]. In this paper, the 
domain decomposition method based on the Schur complement will be used for decreasing the number of 
arithmetical operations for solving difference equations but not as parallelization instrument. Thus, in the 
case under consideration the number of processors used and the number of subdomains will be independent 
values. The efficiency of using supercomputer resources will be completely provided at the cost of employing 
the Dichotomy Algorithm. 
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Figure[31a. presents a medium model, for which it is reasonable to use the domain decomposition method. 
When solving applied geophysics problems it is often required to calculate a wave field with an arbitrary 
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Figure 3: Medium model (a) and solution mesh (b). 



geometry of the free surface [25|, therefore such the medium model will include a relief. 

In addition, to exclude non-physical reflections from the fictitious boundary W4 in the subdomain $74, 
the PML absorbing boundary conditions will be realized (26l 27 1 



(i +-^)P- Poc' M + (qr) + ^) = 0, 



^ - —-^v = 

dt pQ dr^ ' 



(8) 



(t^+l)cp 



lot 



-C-PML 



X IS a user- 



where the absorbing layers profile is given by the function <Tz{z) — 2Lpml 

tunable reflection coefficient, v is the degree of the polynomial attenuation, Cp is the wave velocity, ipML is 
a width of PML region. 

3.1. The Laguerre transform 



Let us seek for a solution to problem ([7]) as a Fourier series in the Laguerre functions 28 1 

CO 

p(x,0 = (^i)^ ^P™(xK(77i), (9) 

m=0 



where l'^{r]t) are the orthonormal Laguerre functions[20|, m is Laguerre polynomial degree, a is the order 
of Laguerre functions and 77 is the transformation parameter. Applying the Laguerre transform to ([7]), we 
obtain a series of problems for defining the expansion factors 



1 V [ni^] VP,„(x)] - p(x)^P„(x) = -J^'J^U + pi^W^/^^EZoi^ - k) 
I ^ ^ on 7, ^ = on W7, P,„ = on ujq, 



Applying the Laguerre transform to equations ([5]) and introducing the notation 



M£^P,(x) in uUn. 

(10) 



(m + a + 1)! 



k=0 
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we obtain the following system of equations 



1 + ^ 



(11) 



1 dP„, 
po dr 



1 dP„ 
po dz 



Here Rm, -^m, ^m, Qm are expansion factors in the Fourier-Laguerre series for the functions Vr, Vz,p, q- 

3.2. Domain decomposition 

In the domain VI — IJ^^q fii (Fig- ISl^) and Vli H j = when i ^ j introduce a rectangular mesh 
^ (Fig. Ob). Inside the domains il^, i = 1,2,3 on the mesh ^ approximate problem PH)) . and in the 
subdomain 5^4 approximate equation (1111) for the PML absorbing boundary conditions. 

As the approximation of elliptic equations is widely covered in the literature 3^, 3l|, , we will only 
mention that for solving equation (1101) a five-point scheme of second order of accuracy was used that was 
constructed by the finite volume method. For solving equation (jlll) , we made use of the scheme of second 
order of accuracy on the staggered mesh(Fig. |31b). 

To reduce the dependence of the number of arithmetical operations on the contrast of the medium, let 
us dwell on the domain decomposition method based on the Schur decomposition 15, [l^. To this end 
the mesh nodes are enumerated in the following order: first the nodes from fii, f22, f^a, 5^4, and then those 
belongin g to the boundaries ^2,^3,^4. Then the difference problem for equations (jlOp . ljlip is written down 
as SLAEIll fH 
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(12) 



where each represents the subvector of unknowns that are interior to subdomain il^ and xr represents 
the vector of all interface unknowns. 



The matrix Ajj corresponds to the difference problem for equation (|10p in the interior of the subdomain 
ilj, J = 1, 2, 3, while the matrix A44 - for equation (jlip . 

First let us calculate components belonging to the boundaries ^1^2,3. To this end we solve the system of 
equations 

5xr = fr - Eti (13) 

where the Schur complement S is defined by S* = Arr ~ St=i ^rA'ii' Ai^. 

Once Xr is determined, the complete solution in the interior of the subdomains is obtained from 



x, = Al^ (f, - A.rxr) , for i = 1, 2, 3, 4. 



(14) 



For the matrix S be calculated not in the explicit form, we use the conjugate gradient method(the 
CG method) |33| for solving problem (IT51) . To implement the CG method, it is necessary to solve the two 
problems. The first one is in that multiplication of a vector by the matrix S requires parallelization of efficient 
procedures for the multiple inversion of the matrices An. In addition, the matrix S is ill-conditioned, hence 
it is required to use the preconditioning procedure. Further we will show that for solving such subproblems, 
one can build efficient parallel procedures based on the Dichotomy Algorithm. 



3.2.1. Multiplication of a vector by the matrix S 

It is evident that the main computer costs are required for the multiple inversion of the matrices An , i — 
1,2,3, 4. These problems are considered to be uniformly distributed among p processors according to Fig. [21b. 
Consider parallel procedures for multiplication of a vector by the matrices {^A^y Al^^ Aiy) , i — 1,2,3,4. 



a. Solution to elliptic equations in the subdomain ili. The difference problem for equation ([u 
in the subdomain fii is in agreement with a system of linear algebraic equations with the matrix An. An 
arbitrary geometry of the free surface 7 can be taken into account in different ways: irregular grids, the 
method of Lagrange multipliers, the method of fictitious domains S^, 3^ 3^, conformal mapping [33|- In 



14| it was shown that for calculation of wave fields for long durations of time one should use grids with 
a high spatial resolution hr^z ~ l/200Amira l/100Ami„, where Xmin is a minimum wavelength. If the 
sound velocity close to the free surface is not high, then due to necessity of using a small mesh size the 
free boundary 7 can be smoothed along the boundaries of the nearest cells. In practice, an admissible error 
for defining the depth of layers bedding, for example for the West Siberia region, makes up about several 
meters, that is why the mesh size equal to a few centimeters allows approximating with a sufficient accuracy 
the relief. Such an approximation makes possible to carry out calculations sufficiently fast, which is more 
reasonable in terms of efficiency. To make use of the approach in question, we apply an algebraic version of 
the method of fictitious domains, that is the fictitious components technique 3^ IH, whose idea is in that 



a subvector 0o being the solution to SLAE with a positive semi-definite matrix 

(15) 
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will also be the solution to the system Ancfio = /g. Let a matrix C correspond to the difference problem 
for the operator Lh = Ah — <P, d € 9\ in the subdomain JIq U Then system ([T5|) can be solved by 
the GMRESffc) method with the preconditioning matrix C per the number of iterations independent of the 
mesh size f3q|. In this case, the main macro-operation is in the inversion of the operator Lh thus allowing 
the use of the Dichotomy Algorithm for the effective parallelization. 

b. Solution to elliptic equations in the subdomains 1^2, 3- Let matrices A22 and A33 correspond to 
the difference problem for equation (1101) for the subdomains ^2 and fl^, respectively. To multiply a vector 
by the matrices ^^2^, ^1^3^ , it is possible to use the method of separation of variables [40j with arithmetical 
operations costs O(A'^logiV), where iV is a common number of mesh nodes in the subdomain. However it 
appears possible to calculate the product of a vector by the matrices Ajj^A^^Air, i = 2, 3 with essentially 
lesser arithmetical costs. For the difference problems Any = Airf, i = 2,3 the mesh function Airf takes 
nonzero values only in boundary nodes of the subdomain. For this right-hand side the calculation of the 
direct Fourier transform will demand only 0{N) arithmetical operations (40j|. To multiply a vector by the 
matrix Ajj^, it is sufficient to calculate components from the vector y that correspond to boundary mesh 
nodes for the subdomain. The inverse Fourier transform for defining the solution only in the boundary mesh 
nodes can be carried out in 0{N) arithmetical operations. Taking into account the fact that the solution 
to tridiagonal SLAE in terms of the method of separation of variables will demand 0{N) arithmetical 
operations, the final assessment of the number of arithmetical operations for multiplying a vector by the 
matrix A^A^^A^r will be 0{N). 

c. Solution to elliptic equations in the subdomain ft^. Let a matrix A44 correspond to the 
difference problem for the PML equations (jll|) in the subdomain il4. The width of a PML region is, as a 
rule, found within the limits of 20 up to 50 mesh nodes, while the number of cells in the radial direction is 
considerably larger, that is Nr 3> N^- With the above enumeration of unknowns, the matrix A44 will be 
a band matrix of order 3NzNr with the bandwidth 3Nz, where factor 3 is conditioned by the necessity of 
computing the three components Vr,Vz,p for the PML region. As Nz ^ Nr and with allowance for the ill 
conditioning of the matrix A44 , for multiplying a vector by the matrix ^144^ it seems reasonable to use the 
Dichotomy Algorithm (Algorithm 2) for block-tridiagonal matricefl 

Thus, all the procedures of solving the local subproblems include the Dichotomy Algorithm. With 
allowance for the results of computer experiments from the previous section, one should expect that the 
dependence of the speedup value on the number of processors for the multiplication of a vector by the 
matrix S will be close to the linear one. 



^For the band matrix, the submatrices Ai,Bi from ((TJ are upper triangular and lower triangular matrices, respectively. 
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3.2.2. Preconditioning 

By now there have been developed relatively many sequential versions of preconditioning procedures 
for solving problem (jl3p 15l . Il6l . |41| . However for supercomputers a class of effective preconditioners is 
essentially less. In this paper, we use a preconditioner based on the probing technique 42, 15, [l^. The 



operator S is approximated by an operator B on a, certain subspace, the latter being constructed so as B 
be readily invertible. In this case the matrix B will be a band one. The probing technique does not demand 
the knowledge about the structure of the operator S and is a purely algebraic approach. To calculate 
the matrix i3, one should realize the multiple multiplication of the matrix S by specially selected vectors 
Pi, 1 < ' < 2d + 1, where d is the bandwidth of the matrix B. This procedure was discussed in the previous 
section. As the bandwidth is essentially less than the order of the matrix B, it appears possible to use the 
Dichotomy Algorithm for block-tridiagonal matrices when solving SLAEs with a band matrix B. 



3.3. Numerical experiments 

Let the size of the computational domain be Lr = 7km, = 1.5km. A point source is located on the 
symmetry axis at a depth of 15m from the free surface; the time dependence being given as 



f{t) = exp 



9' 



sin(27r/o(i-io)), (16) 



where /o — SOHz, to = 0.2s, g — 4. The number of addends in series © was n = 6000; the expansion 
parameters were a = 5, rj — 1800. For the PML boundary conditions, the following parameters were 
selected: Lpml = 30/iz, Cp = 4400m/s, ly = 2, x= lO"*". 



The issues concerning the spectral algorithm based on the Laguerre transform were studied in 13|, |lj] , 
therefore we will dwell on performance and efficiency of the parallel algorithm. 

The matrices S and B are not spectrally equivalent, therefore with decreasing the mesh step the number 
of iterations of the conjugate gradient method for solving ([T^ will increase [421 • However increasing the 
bandwidth of the matrix B, denoted as d makes possible to decrease the number of iterations (Table |3]). 
With a twofold decrease of the mesh step the value of the parameter d should twofold be increased for 
the number of iterations of the CG method be not increased. Thus, preconditioning based on the probing 
technique allows a considerable decrease in computer costs, while the Dichotomy Algorithm makes possible 
to efficiently solve SLAEs with the preconditioning matrix. 

X Nr 3046 X 16384 6078 x 32768 12134 x 65536 



d 


Gen 


Total 


Iter 


Gen 


Total 


Iter 


Gen 


Total 


Iter 


no prec 




370 


4000 




1550 


5600 




13900 


9710 


3 


0.75 


2.28 


10 


1 


13 


6.6 


4.51 


40 


16 


5 


0.57 


2 


9 


1.6 


4.11 


5.6 


8.3 


38 


14 


7 


1.3 


1.9 


8 


2.14 


5.61 


10 


17 


36 


13 


11 


1.6 


1.2 


6 


3.43 


5 


9 


17.5 


34 


11 


21 


2.8 


1.13 


5 


6.61 


4.45 


7 


33 


32 


9 


51 


6.9 


0.88 


3 


16.24 


3.6 


4 


87 


27 


6 


101 








34 


3.54 


3 


166 


26 


4 



Table 3: (Gen) is the time of calculation of the preconditioning matrix B. (Total) is the time needed for solution to one 
problem of the form of II13II . In this case it is necessary to carry out (Iter) iterations, (d) is the bandwidth of the preconditioning 
matrix, the number of processes being constant p = 256. 

A feature of the parallel algorithm proposed is in that before solving a series of problems ((TU)l . ([TT]) . it 
is required to conduct preliminary calculations. The time assessments for preliminary calculations (P) are 
given in Table |31 which also represents the preliminaries for the Dichotomy Algorithm for inverting the 
matrices An, i = 1,2,3,4 as well as the costs for the calculation of the preconditioning matrix B. The 
number of terms in series ^ for long time durations makes up several thousands, that is why the time 
needed for the preliminaries can be neglected. This is because of their smallness as compared to the general 
computation time (T). 



10 




Figure 4: Snapshots for the wave field at i = 3s (a) for model pic|3]a and same model with additional low-velocity layer (b). 
Nz X Nr = 12134 X 65536. 



The smaller speedup coefficient (S) (Table |4|) as compared to the Poisson equation solution[l] is due to 
the necessity of complementary interprocessor communications for the GMRES(/j) method for the matrix 
All inversion. Moreover, the multiple inversion of the preconditioning operator C in the interior of the small 
subdomain fig U fii causes an increase in the communication time as related to the computation time and, 
hence, the scalability of the parallel algorithm decreases. 

In Section [3.2.11 it was shown that the multiplication of a vector by the matrices Aj^A^^ Air, i — 2,3 
can be done in 0{N) operations instead of 0{N log N). A similar situation arises when multiplying a vector 
by the matrix Aj-^A^^ Air- This is explained by the fact that when solving the problem Any = Airi 
an essentially lesser number of iterations of the GMRES(fc) method is required as compared to ^ny = f . 
To solve the equation with the matrix An 13 iterations of the GMRES(k) method for the right-hand side 
f were used, while for Airi, the number of iterations was 1. Thus, each iteration of the CG method for 
solving problem (jl3p demands an essentially lesser number of arithmetical operations than one would use 
the Laplace operator as preconditioner for the whole computational domain. Moreover, in the latter case 
the number of iterations would be essentially larger due to a high contrast of the medium. 



3046 X 16384, 6078 x 32768, 12134 x 65536, 



NP 


P 


T 


S 


P 


T 


S 


P 


T 


S 


32 


42 


9 




179 


38 










64 


20 


4.5 


64 


86 


17.2 


70 








128 


20 


2.1 


137 


60 


9 


135 


379 


53 




256 


23 


1 


288 


46 


4.5 


270 


197 


27 


251 


512 


34 


0.65 


443 


56 


2.1 


579 


383 


13 


521 


1024 


60 


0.56 


514 


83 


1.52 


800 


281 


8 


848 



Table 4: P is the total time of the preliminaries, T is the time of computing one harmonic from lO, S is the speedup value, d 
is the bandwidth of the precondition matrix, NP is the number of processors, Nr,Nz is the number of mesh size towards -R 
and Z, respectively. 
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The Dichotomy Algorithm at ah the stages of solving problem ([7]) provides a high performance and 
scalability of the proposed parallel algorithm. This allows us to carry out engineering calculations (Fig. U) 
based on efficient algorithms with the use of thousands of processors. It should be noted that the most 
efficient and at the same time difficult for parallel realization numerical methods are used. 

4. Conclusion 

In this paper the new parallel algorithm for solving SLAEs with the same block-tridiagonal matrix but 
different right-hand sides is proposed. To demonstrate the efficiency of the approach proposed, a problem of 
modelling the acoustic wave fields by the spectral-difference algorithm has been solved. A high performance 
of the Dichotomy Algorithm allows an effective use of the domain decomposition on a supercomputer. It 
should be noted that the domain decomposition was realized not for providing the parallel computation, 
but for decreasing the total number of arithmetical operations. In our case, the number of processors 
and subdomains are independent quantities, therefore the rate of convergence of the iterative method is 
independent of the number of processors. To solve the system of equations for the PML boundary conditions, 
the Dichotomy Algorithm was used. 

To reduce the total computation time, the probing technique was used as preconditioning procedure. 
The probing technique in the context of parallel algorithms has not been widespread by now due to the 
necessity of solving SLAEs with band matrices. The efficient inversion of such matrices with the use of 
supercomputer systems is a non-trivial task. However, the development of the Dichotomy Algorithm has 
allowed one to overcome this difficulty. Now, this type of a preconditioner can be successfully implemented 
on supercomputers. 

The numerical experiments carried out with 16 up to 2048 processors have proved the efficiency of the 
approach proposed. The dependence of the speedup value on the number of processors appears to be near- 
linear. Thus, a high performance and simplicity of service of the Dichotomy Algorithm allow one to include 
it into already existing sequential numerical procedures for their parallelization. 
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